Add evals/: schema-rejection and tool-retrieval regression coverage#62
Merged
Conversation
Two cheap, runnable evals that turn behavior we care about into numbers we can re-measure on every PR: - evals/schema_rejection/ — 21 calls (19 deliberately malformed, 2 baselines) classify each outcome by layer (schema / IO / runtime / silent). Headline number is caught_rate. Currently 94.7% with 1 silent pass (plot_dataset with plot_type='variable' but no variable_name still returns a plot). - evals/tool_retrieval/ — BM25 over the full ~54-function tool surface against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection accuracy and mean rank of the correct tool. Currently 77% / 87% / 93%. Both runners run in under 30 seconds with no external dependencies. Result JSON files are gitignored; the runners are the source of truth. evals/README.md explains what an eval is for a non-AI engineer and lists when to add new ones vs. when to write a unit test instead.
…ually type
Targeted the 7 tools that ranked worst in the BM25 retrieval eval — rewrote
each first line to include the words a user would naturally use ("wireframe",
"colored map", "ensemble", "time average", "is the endpoint healthy", "start
a new session", "list variables") rather than internal jargon.
evals/tool_retrieval results, same 30-prompt set:
before: top-1 77%, top-3 87%, top-5 93%, mean rank 2.33, worst rank 19
after: top-1 93%, top-3 100%, top-5 100%, mean rank 1.07, worst rank 2
The two remaining rank-2 cases are genuinely ambiguous (plot_mesh vs.
plot_mesh_geo; inspect_variable vs. get_capabilities) and the right ones
land in the top-3 shortlist — which is what discover_tools will return.
Tools touched: create_session, calculate_temporal_mean, calculate_ensemble_mean,
diagnose_endpoint, inspect_variable, plot_mesh, plot_variable, plot_mesh_geo,
get_capabilities. Behavior unchanged; only the leading docstring sentence
moves.
Pre-commit (including mypy) and the full test suite (295 tests) pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two cheap, runnable evals under
evals/that turn behavior we care about into numbers we can re-measure on every PR.evals/schema_rejection/— 21 calls (19 deliberately malformed, 2 baselines). Classifies each outcome by layer (schema / IO / runtime / silent) and reportscaught_rate. Currently 94.7% with 1 silent pass —plot_dataset(plot_type='variable')accepts a call with novariable_nameand returns a plot anyway. That's a real bug surfaced by the eval; tracked separately, not fixed in this PR.evals/tool_retrieval/— BM25 over the full ~54-function tool surface against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection accuracy and the mean rank of the correct tool. Currently 77% top-1, 87% top-3, 93% top-5.Both runners complete in under 30 seconds with no external dependencies. Eval result JSON files are gitignored; the runners themselves are the source of truth.
evals/README.mdexplains what an eval is for a non-AI engineer and lists when to add one vs. when to write a unit test.Test plan
uv run pre-commit run --all-files— passes.uv run pytest tests/ --ignore=tests/test_remote_agent.py— 295 passed.uv run python -m evals.schema_rejection.run— completes; 1 known silent-pass bug reported.uv run python -m evals.tool_retrieval.run— completes; 77 / 87 / 93 numbers reproduce.